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ABSTRACT 

Complex software systems are among most sophisticated 
human-made systems, yet only little is known about the 
actual structure of 'good' software. We here study differ- 
ent software systems developed in Java from the perspec- 
tive of network science. The study reveals that network 
theory can provide a prominent set of techniques for the 
exploratory analysis of large complex software system. We 
further identify several applications in software engineering, 
and propose different network-based quality indicators that 
address software design, efficiency, reusability, vulnerability, 
controllability and other. We also highlight various interest- 
ing findings, e.g., software systems are highly vulnerable to 
processes like bug propagation, however, they are not easily 
controllable. 

Categories and Subject Descriptors 

D.2.8 [Software Engineering]: Metrics — complexity mea- 
sures, performance measures, software science 

General Terms 

Theory, algorithms, experimentation. 

Keywords 

Software systems, Software engineering, Software networks, 
Network analysis. 

1. INTRODUCTION 

Complex software systems are among most sophisticated 
systems ever created by human. Nevertheless, only little is 
known about the actual structure and quantitative proper- 
ties of large software systems [61. For instance, in the context 
of software engineering, one is interested in how 'good' soft- 
ware looks like. Commonly adopted approaches and tech- 
niques fail to give a comprehensive answer [5j[7], moreover, 
there is also a lack of a simple but yet rigorous framework for 
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software analysis (to our knowledge). The above dilemma 
was denoted software law problem [6j, which urges towards 
identifying (physical) laws obeyed by software systems that 
could be used in practical applications. 

Networks possibly provide the most adequate framework 
for the analysis of the structure of complex systems like 
software project^] Also, due to their simple and intelli- 
gible form, analysis of different networks has already pro- 
vided several significant discoveries in the last decade [46] 
[3j |16| [23] . Note that the adoption of software networks is 

not novel [35 27 19, 39 , however, network analysis is still 
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only rarely used in software engineering. The main purpose 
of this study is thus to highlight different techniques devel- 
oped in the field of network analysis, and to expose their 
use in software comprehension, development and engineer- 
ing. We review most of the past work on different types of 
software networks, whereas we also include network analysis 
techniques proposed just recently [23[ |44| . (Note that the 
main focus of the paper is merely a review, rather than a de- 
tailed comparison of network analysis techniques with other 
approaches.) 

The study in the paper analyses software networks on dif- 
ferent levels of granularity. First, we address the macro- 
scopic properties of software networks like scale-free and 
small- world phenomena [46 1 [3] that are related to the struc- 
ture and design of the entire project, or projects, represented 
by the network. Second, we analyze the microscopic prop- 
erties of individual nodes, with special emphasis on differ- 
ent dynamical processes occurring on software networks like 
bug propagation [5] [30]. The above can be related to soft- 
ware quality, complexity, reusability, robustness, vulnerabil- 
ity and controllability. Third, we also identify mesoscopic 
structural modules within software networks [16| |44| and 
show their applicability in the context of software abstrac- 
tion and refactoring. The paper thus exposes network analy- 
sis as a prominent set of techniques for software engineering. 

The rest of the paper is structured as follows. Section [2] 
introduces software networks used in the study. Section [3] 
analyzes different characteristics of adopted networks and 
discusses their use in software engineering. Some applica- 
tions of the presented techniques are given in Section[4] while 
Section [5] concludes the paper. 

2. SOFTWARE NETWORKS 

Various types of networks have been proposed for the 
analysis of the structure of complex software systems. For 

throughout the paper, the term project refers to a reposi- 
tory of software code. 



Figure 1: (left) A simple Java class and the corresponding part of class dependency network. Direction 
of links is (mostly) just the opposite to the flow of information, (right) Class dependency network of java 
(circles) and javax (triangles) namespaces of Java language. 



Table 1: Properties of class dependency networks used in the study. 
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instance, software architecture maps 35 , software mirror 



graphs 6], class, method and package collaboration graphs [17] 
subrutine call graphs [27], inter-package dependency net- 
works |21| , software class diagrams [37] and class depen- 
dency networks [39] , to name just a few. Networks mainly 
divide whether they are constructed from source code, byte 
code or program execution traces, and due to the level of 
software architecture represented by the nodes, and the set 
of interdependencies represented by the links. 

For consistency with some previous work [4] |17[ |47[ [39] , 
we construct networks from the source code of different Java 
project^] (Table [lj. Due to the object-oriented view of 
Java language, nodes in the network can represent either 
project packages, software classes, methods and functions 
or individual lines of code. We here adopt class dependency 
networks [39], where nodes represent classes and links cor- 
respond to different dependencies among them (Figure [I]). 
The latter is based on the following reasons. First, as net- 
works are constructed merely from the signatures of differ- 
ent classes, and functions and fields therein, they are only 
mildly influenced by the subjective nature of each individual 
developer. (This can be more adequately modeled by, e.g., 
text mining applied to the names of different programming 
constructs [20].) Second, mesoscopic structures of class de- 
pendency networks coincide with project packages 
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which enables various applications in software engineering 
(Section Q. Third, such networks relate to the information 
flow between different parts of software project, and also co- 
incide with the human comprehension of the object-oriented 
software systems. 

Note that class dependency networks address only the 



inter-class structure of the software project, whereas the 
intra-class dependencies are disregarded. However, similarly 
as above, the latter reflect also the programming style of a 
particular developer, rather than the intrinsic structure of 
the software project alone. Nevertheless, future work will 
extend the study to inter- and intra-class dependencies us- 
ing to concepts of interdependent or coupled networks [26| 

Formally, let a project consist of classes A = {Ai, A2, . . . } 
and let P be the set of software packages (bottom-most 
level of the package hierarchy). Corresponding class depen- 
dency network is then a directed graph G(N, L), with nodes 
N — {1, . . . ,n} and links L (m = jij). Node i corresponds 
to a class Ai, however, since isolated nodes are discarded in 
the analysis, n < \A\. A directed link (i,j) £ L represent 
some dependency between classes Ai and Af. inheritance 
(Ai inherits or implements Aj), parameter (Ai contains a 
method, function or constructor that takes Aj as parame- 
ter), return (Ai contains a method or function that returns 
Aj) and field (Ai contains a field of type Aj). Denote k 
to be the average degree in the network (i.e., average num- 
ber of links incident to a node). Furthermore, let k ln and 
k out be the average in-degree and out-degree of the nodes, 
k = k ln + k out . Hence, k° ut corresponds to a number of 
other classes required to implement the functionality of a 
respective class Ai, while k\ n corresponds to the number of 
classes that use (depend on) Ai. Last, denote LCC to be 
the fraction of nodes in the largest connected component]^] 

Table [l] shows properties of class dependency networks 
used in the study. Networks were selected thus to represent 
a diverse set of software systems including utility libraries 



2 Networks are available from |http: / /lovro.lpt.fri.uni-lj .si /] 



3 All networks in figures are reduced to LCC-s. 



Degree 

Out-degree 
In- degree 





Node degree 



Figure 2: Degree distributions of weka, javax and java networks. 



Table 2: Different statistics for class dependency networks used in the study. 
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(e.g., flmng and colt networks), complete frameworks (e.g., 
jung and weka networks) and also the core of Java language 
itself (i.e., java network). 

Software networks are compared against Erdos-Renyi ran- 
dom graphs [l2], where a link are placed between each pair 
of n nodes with probability k/(n — 1), where k = 2m /n for 
some n and m. 

3. ANALYSIS AND DISCUSSION 

3.1 Scale-free networks - software complexity 
and reusability 

Simple random graphs experience a Poisson degree dis- 
tribution Pk, Pk ~ ^-§i — • On the contrary, pk of most 
real-world networks including software networks follows a 



power-law form pu ~ k ' 3j |35| |31| |9], where 7 is a scale- 
free exponent, 7 > 1. The latter can be clearly observed by 
a straight line with slope —7 in a log- log plot (Figure |2j. 
Networks with power-law degree distributions are denoted 
scale-free, while 7 can be directly related to the spreading 
processes occurring on networks 32 (e.g., bug propagation). 
For 7 6 (2, 3), even a very small fraction of faulty nodes can 
already render the entire system inapplicable 30 32 . Un- 



fortunately, the latter applies for all software networks used 
here (Table [2). 

Scale-free networks are usually considered an artifact of 
Yule's process or rich get richer phenomena [3j. For class 
dependency networks, this refers to the fact that highly used 
classes are, obviously, well known among developers, and 
would thus also be more commonly adopted in the future. 
However, power-laws should thus emerge merely in the in- 
degree distribution pl n that refers to the number of times 
each class is used [36| [4] (Figure |2J. More precisely, scale- 
free nature of p\ n is a result of high code reusability. On 
the other hand, out-degree distribution p° k ut is related to 
software complexity, since classes with high k° ut encompass 



most complex functionality. Here, complexity refers to the 
number of other classes needed to implement the function- 
ality of the respective class. For example, most commonly 
reused class in java network is String, whereas FileDialog 
is the most complex one (Table [3|. 

Well developed software project should thus exhibit scale- 
free p l k and highly truncated p% ut . Next, lower 7 indicates 
higher code reuse, which also decreases the probability of 
fault propagation throughout the system. Last, classes with 
very high k° ut , and also fcf™, should be implemented with 



extra care (see Section 3.3 1 



3.2 Small- world networks - software structure 
and design 

Software networks exhibit small- world phenomena [46] (see 
[27} [38] and Table which refers to high clustering C [46] 
and very short average distance between the nodes I [I] (also 
known as six degrees of separation [25]). C measures tran- 
sitivity in the network, and is defined as the probability 
that two neighbors of a node are also linked, C £ [0,1]. 
I = , 1 V] . , . da, where da is the distance between i in 

n(n — 1) ^— 't^J ■/ ' " 

j in the respective undirected network (i.e., number of links 
in the shortest path) . Small-world networks most commonly 
refer toC> Cer and I « Ier [46], where Cer and Ier are 
the values for a corresponding random graph. 

Clustering of software networks can be related to intrinsic 
characteristics of the underlying systems [43]. For instance, 
visualization classes usually experience very high clustering, 
while clustering is almost zero for I/O classes |43| |40] , 

Average distance I is an important indicator of the struc- 
tural design of the project, or projects, represented by the 
network. More precisely, since I « Ier, I Ier indicates 
that the underlying software system has divided into several 
independent parts with rather different functionality (Fig- 
ure [3| . Note also that software networks should never be 
combined with the core of the language, since the latter 




Figure 3: A random graph, jung network, jung & colt network and jung & java network. Average distance 
between the nodes I equals 3.88, 4.19, 5.37 and 2.18. Node symbols correspond to clustering D |33| that ranges 
between (triangles) and 1 (circles). 



weka javax java 




Figure 4: weka, javax and java networks with highlighted seed nodes. 



completely obscures its structure and dynamics. 

It ought to be mentioned that software networks are small- 
world only in the undirected case 19 . The contrary would 



imply a cyclic flow of information within the software project. 
For instance, high-level Java class String does not use the 
functionality of a lower-level FileDialog. Let E be the ef- 
ficiency of network information flow [22] defined as E — 
w /„ 1 _ 1 ) 1 / rfi j , where d'ij is the distance from i to j in 

a (directed) network, E 6 [0, 1]. Small- worlds should result 
in high flow efficiency E, however, software networks have 
E » (Table [2). 

Well designed software project should thus experience C 3> 
Cer, I ~ Ier and E « 0. Also, one should be wary of 
I 3> Ier throughout the project development. 

3.3 Network nodes - software vulnerability and 
control 

In the context of spreading processes on software net- 
works [28JJ45] (e.g., bug propagation) and network robust- 
ness |2| |32| (i.e., software vulnerability), one is interested 
into so called seed nodes that could originate the propaga- 
tion of faults through the entire system 4 ! Centrality metrics 
that measure nodes influence are commonly regarded as a 
prominent indicator of seed nodes 
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to be the degree centrality defined as DC; 
where ki is the degree of node i, DCi £ [0, 1] 
CCi to be the harmonic closeness centrality defined as the 



Denote DC; 
= ki/(n - 1), 
Next, denote 



4 Although a poor implementation of any software class al- 
ready makes the system vulnerable, the problem is even am- 
plified in the case of, e.g., highly reused classes. 



average inverse of distance from i to the rest of the nodes, 
CCi = ^ iMj, CCi S [0, l]. Last, denote BCi to be 

the betweenness centrality defined as the fraction of shortest 
paths between the nodes that go through i, BCi G [0, 1]. 

As ki w k\" for software networks, DCi actually identi- 
fies classes with the highest code reuse or, equivalently, high 
in-degree kl n (Table|3]). Similar set of influential classes is re- 
ported by BCi (Table On the other hand, CC; identifies 
classes that somewhat coincide with high complexity classes 
identified in Section [3. 1| BCi (and DCi) thus reveals classes 
whose faulty implementation could influence the entire sys- 
tem, whereas CCi exposes classes that are most prone to 
an arbitrary fault within the system. The former commonly 
reside in the core of the respective software network, while 
the latter are found in the periphery (Figure |4|. 

Extra care should be put in the development of classes 
with high BCi, while high CCi classes can be adopted for 
an effective, and also efficient, software testing. 

Network controllability has just recently been proposed for 
the analysis of directed real-world networks |24l 23 . Here, 



one is particularly interested in the number of driver nodes 
rid that one has to govern in order guide the entire sys- 
(i.e., gain control over the output of the system 
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tem 

under the assumption of simple linear transformations) . For 
scale-free networks with p| n equal to p% ut , n d /n w e fc (7-2)/(2-2 7 ) ^ 
7 > 2 23) . Note that, contrary to seed nodes (Table [ZJ 
and general belief, driver nodes tend to avoid high degree 
nodes [23) [n]. 

Most software network are not highly controllable, since 
one would have to manage 30-50% of classes in order to con- 



Table 3: Hubs (i.e., nodes with very high degree) within weka, javax and java networks. 
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Table 4: Seed nodes (i.e., very influential nodes) within weka, javax and java networks. 
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trol the entire project (Table j2j). Nevertheless, due to high 
density, the core of Java language can be controlled through 
merely 17% of classes in java namespace. For comparison, 
rid/n equals w 80% for regulatory networks, « 50% for the 
Internet, « 30% for power grids and on-line social networks, 
while, interestingly, it is below 3% for corporate ownership 
networks [23| . 

Controllability of a software system can be limited by de- 
creasing k or 7, which is achieved by decreasing code com- 
plexity and increasing code reuse (Section |3.1[ ). 

3.4 Network modules - software aggregation 
and modularity 

Packages of the software system reflect in different struc- 
tural modules within class dependency networks [39| |44| . 
For instance, visualization classes commonly aggregate into 
communities of densely connected nodes [16], whereas dif- 
ferent parsers, transformers or plugins often arrange into 
functional modules [43] that correspond to (disconnected) 
groups of nodes with common linkage patterns. Otherwise, 
clear community structure signifies highly modular struc- 
ture of the respective software system, while well supported 
functional modules are related to clear functional roles of 
the classes within the project [39[ |44[ [43] . 

Table[5]compares software packages against network mod- 
ules identified with MO [8] and CP [42] [44] community de- 
tection approaches, and MM [29] and CP [44] [43] struc- 
tural module identification algorithms. Analysis reveals that 
general structural modules including communities and func- 
tional modules most accurately model the package structure 
of the software systems in this study. 

4. APPLICATIONS 

Due to space limitations, the following section only briefly 
describes different applications of network analysis techniques 
presented in Section [3] Future work will focus on a more 
detailed examination and development of supporting imple- 
mentations that could be easily applied in practice. 



4.1 Software project abstraction 

Figure [5] shows an application of network structural mod- 
ule detection to software project abstraction. One can iden- 
tify an entire hierarchy of modules that is consistent with the 
package hierarchy, while also enclosing class dependencies 
that go beyond packages decided by the developers. Besides 
better comprehension, revealed hierarchy enables the predic- 
tion of dependencies between the classes of a project [43] . 

4.2 Software packages refactoring 

Network module detection algorithms can also be applied 
for refactoring of software packages [39[ |43] . One can adopt 
a community detection algorithm to reveal highly modular 
structure (Figure [6] (left)) or a functional module detec- 
tion algorithm to identify the underlying functional struc- 
ture (Figure [6] (middle)). General structural module detec- 
tion algorithms partition software classes according to both 
modular and functional links that are present among the 
dependencies of the project (Figure [6] (right)). 

4.3 Software packages prediction 

Table [6] shows classification accuracies for the prediction 
of software packages for the classes of different systems. Let 
i be a node corresponding to class At. Package of Ai is 
then predicted to be the most likely package considering 
nodes within the same structural module as i. The nodes 
are weighted according to Jaccard similarity [l8], which is 
defined as |F; n Fjl/jTi U | , where j is a similar node and 
Ti is the neighborhood of node i. Structural modules are 
identified with the algorithm in [44[ |43] . 

On average, one can predict software packages with ~ 
80% probability for most classes of the systems considered, 
whereas complete package hierarchy can be precisely identi- 
fied for over 60% of the software classes (Table [6]. 

4.4 Software quality indicators 

Table [7] and Table [8] show software project and class qual- 
ity indicators identified in the study. Indicators can be em- 
ployed to assess project structure and design, code complex- 



Figure 5: (left) jung network where node symbols represent high-level packages of JUNG framework: visu- 
alization (circles), io (triangles), graph (squares) and algorithms (diamonds), (right) Hierarchy of structural 
modules revealed with the algorithm in |43|. 




structural modules conveying modular and functional links (bottom-most level of the hierarchy in Figure [HI 



Table 5: Normalized mutual information [10] (NMI) between software packages and identified network mod- 
ules, NMI £ [0, 1]. Number of modules is shown in small font. 
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Table 6: Classification accuracy (CA) for software package prediction, CA £ [0, 1]. (1^ is the number of levels 
of the package hierarchy, whereas I is the average level for a software class. Value under Pi corresponds to 
CA for the i-th level of the hierarchy.) 
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Table 7: Software project quality indicators presented in the study. For each indicator, we give the range 
and the expected value of a well designed software system (based on Section |3[). 
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Table 8: Software class indicators presented in the study. For each indicator, we give the range and the 
expected value of highly influential, most vulnerable or high complexity classes (based on Section pi). 
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ity and reusability, controllability and vulnerability, infor- 
mation flow, and other. Due to space limitations, compari- 
son with other approaches for measuring software quality is 
omitted (e.g., metrics of coupling and cohesion |34| ). 

5. CONCLUSIONS 

The paper conducts a comprehensive study of software 
networks constructed from Java source code. First, we ad- 
dress macroscopic network properties that are related to 
structural design of the corresponding software project. Next, 
we analyze the networks on a microscopic level of nodes, to 
highlight most influential and vulnerable software classes. 
Last, we analyze mesoscopic network structural modules 
and expose their applicability in project refactoring. Among 
other, we show that software systems are highly vulnerable 
to processes like bug propagation, however, they are not eas- 
ily controllable. On the other hand, Java language can be 
controlled through merely 17% of java namespace. We also 
identify several network-based quality indicators that can be 
employed to assess software project design, reusability, ro- 
bustness, controllability and other. The study thus exposes 
network analysis as a prominent set of tools for software 
systems engineering. 
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